29 research outputs found
Demographic Inference and Representative Population Estimates from Multilingual Social Media Data
Social media provide access to behavioural data at an unprecedented scale and
granularity. However, using these data to understand phenomena in a broader
population is difficult due to their non-representativeness and the bias of
statistical inference tools towards dominant languages and groups. While
demographic attribute inference could be used to mitigate such bias, current
techniques are almost entirely monolingual and fail to work in a global
environment. We address these challenges by combining multilingual demographic
inference with post-stratification to create a more representative population
sample. To learn demographic attributes, we create a new multimodal deep neural
architecture for joint classification of age, gender, and organization-status
of social media users that operates in 32 languages. This method substantially
outperforms current state of the art while also reducing algorithmic bias. To
correct for sampling biases, we propose fully interpretable multilevel
regression methods that estimate inclusion probabilities from inferred joint
population counts and ground-truth population counts. In a large experiment
over multilingual heterogeneous European regions, we show that our demographic
inference and bias correction together allow for more accurate estimates of
populations and make a significant step towards representative social sensing
in downstream applications with multilingual social media.Comment: 12 pages, 10 figures, Proceedings of the 2019 World Wide Web
Conference (WWW '19
The importance of neutral examples for learning sentiment
Abstract. Most research on learning to identify sentiment ignores “neutral” examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutral examples. Moreover, the use of neutral training examples in learning facilitates better distinction between positive and negative examples
Authorship Verification as a one-class classification problem
1 In the authorship verification problem, we are given examples of the writing of a single author and are asked to determine if given long texts were or were not written by this author. We present a new learning-based method for adducing the “depth of difference ” between two example sets and offer evidence that this method solves the authorship verification problem with very high accuracy. The underlying idea is to test the rate of degradation of the accuracy of learned models as the best features are iteratively dropped from the learning process. 1
Using neutral examples for learning polarity
Sentiment analysis is an example of polarity learning. Most research on learning to identify sentiment ignores “neutral ” examples and instead performs training and testing using only examples of significant polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons and show how neutral examples help us obtain superior classification results in two sentiment analysis test-beds. Many machine-learning problems involve predicting an example’s polarity: is it (significantly) greater than or less than some standard. One canonical example of learning polarity is sentiment analysis, the determination of whether a particular text expresses positive or negative sentiment regarding some issue. The problem of how to exploit a labeled corpus to learn models for sentiment analysis has attracted a good deal of interest in recent years [Dave et al 2003, Pang et al 2002, Shanahan et al 2005]. One common characteristic of almost all this work has been the tendency to define the task as a two-category problem: positive versus negative. In almost all actual polarity problems, including sentiment analysis, there are, however, three categories that must be distinguished: positive, negative and neutral. Not every comment on a product or experience expresses purely positive or negative sentiment. Some – in many cases, most – comments might report objective facts without expressing any sentiment, while others might express mixed or conflicting sentiment. Researchers are aware, of course, of the existence of neutral documents. The rationale for ignoring them has been a reliance on two tacit assumptions: • Solving the binary positive vs. negative problem automatically solves the three-category problem since neutral documents will simply lie near the boundary of the binary model • There is less to learn from neutral documents than from documents with clearly defined sentimen
The importance of neutral examples for learning sentiment
Most research on learning to identify sentiment ignores “neutral ” examples, learning only from examples of significant (positive or negative) polarity. We show that it is crucial to use neutral examples in learning polarity for a variety of reasons. Learning from negative and positive examples alone will not permit accurate classification of neutral examples. Moreover, the use of neutral training examples in learning facilitates better distinction between positive and negative examples.
Exploiting Stylistic Idiosyncrasies for Authorship Attribution
Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes 2000). In recent years machine learning methods have been applied to authorship attribution. A few examples include (Matthews & Merriam 1993, Holmes & Forsyth 1995, Stamatatos et al 2001, de Vel et al 2001). Both the earlier "stylometric" work and the more recent machine-learning work have tended to focus on initial sets of candidate discriminators which are fairly ubiquitous. For example, the classical work of Mosteller and Wallace (1964) on the Federalist Papers used a set of several hundred function words, that is, words that are context-independent and hence unlikely to be biased towards specific topics. Other features used in even earlier work (Yule 1938) are complexity-bas